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ABSTRACT 

One issue which is central in developing a general purpose FFT subroutine on a dis- 
tributed memory parallel machine is the data distribution. It is possible that different users 
would like to use the FFT routine with different data distributions. Thus there is a need to 
design FFT schemes on distributed memory parallel machines which can support a variety 
of data distributions. In this paper we present an FFT implementation on a distributed 
memory parallel machine which works for a number of data distributions commonly encoun- 
tered in scientific applications. We have also addressed the problem of rearranging the data 
after computing the FFT. We have evaluated the performance of our implementation on a 
distributed memory parallel machine Intel iPSC/ 860 . 


Research for the last two authors was supported in part by the National Aeronautics and Space Ad- 
ministration under NASA Contract Nos. NAS1-18605 and NAS1-19480 while the authors were in residence 
at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research 
Center, Hampton, VA 23681-0001. 
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1 Introduction 


Fourier transform plays an important role in mathematical and numerical analysis. Some 
of the applications which use the Fourier transform are: digital filtering, auto and cross 
correlation, solution of partial differential equations etc. The fast Fourier transform algo- 
rithm (FFT) computes the transform of an n-component sequence of complex numbers in 
O (n log n) time. The implementation of fast Fourier transform on parallel machines has been 
well studied, for example [6, 3, 4, 7, 5, 8, 11]. However, there has been very little effort in 
investigating implementations which can provide a basis for a general purpose subroutines 
on distributed memory machines. 

One issue which is central in developing a general purpose subroutine on a distributed 
memory parallel machine is the data distribution [2]. It is possible that different users wish 
to use the FFT routine with different data distributions. Typically, users determine their 
data distribution based on the over all application requirements, which could be different for 
different users. Thus there is a need to design FFT schemes on distributed memory parallel 
machines which can support a variety of data distributions. 

There are two possible approaches to this problem. The first one is to design an FFT 
subroutine for a specific data distribution which gives optimal performance, along with a set 
of basic communication subroutines to convert a user data distribution to the specific data 
distribution. This approach has the problem of rearranging the user data initially which is 
quite costly on distributed memory parallel machines. The second approach is to design an 
FFT routine which works well for arbitrary data distributions. We have designed a scheme 
which takes the second approach, but only supports a set of common data distributions as 
opposed to arbitrary data distributions. A common set of data distributions, referred to 
as block scattered distributions, has been identified by Walker and Dongarra [14] as very 
useful for distributed memory parallel machines. Block scattered distributions encompass 
the two most common data distributions; the linear data distribution and the scattered data 
distribution. For a one dimensional data set, a block scattered distribution is specified by 
the block size. The data is divided into a set of equal sized blocks. A block j is mapped to 
node ( j mod p), where p is the number of nodes. For example, two data distributions for a 
one dimensional array of 16 data values on a 4 node machine with two different block sizes 
are shown in Figure 1(a) and 1(b). 

In this paper we give an FFT implementation on a distributed memory parallel machine 
for block scattered distributions of data with different block sizes. We have also addressed the 
problem of rearranging the data after computing FFT on the same machine. The motivation 
for rearrangement comes from problems such as solution of partial differential equations using 
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spectral techniques which require the final data distribution to be identical to the initial one. 
Finally, we evaluated the performance of our implementation on the Intel iPSC/860. 

The rest of the paper is organized as follows. In the second section we discuss the issues 
involved in designing a parallel FFT algorithm. In the third section we propose a parallel FFT 
algorithm for variable block size. The fourth section describes the rearrangement problem. 
In the fifth section there is a discussion of the experimental results and in the sixth section 
we give conclusions. 

2 Fourier Transform 

The DFT, X(fc), of an N-point sequence a:(7-) is defined as, 

N - 1 

X(k) = Y x(r)e - j2 * rk/N , 0<k<N-\, (1) 

r=Q 

where j — y/—l. We assume, for convenience, that N is a power of 2. 


FFT Algorithm 

There are two distinct classes of FFT algorithms; namely decimation in time, and decimation 
in frequency. We have used a decimation in frequency FFT algorithm and it is described 
here briefly. (For details one can refer to [1]). To begin with, the N-point sequence z(r) is 
divided into two halves, aq(r) and x 2 ( r ) so that the the transformed sequence can be written 
as 

(Af/2— 1) 

X(2 k) = Y [xi(r) + x 2 (r)p N 2kr (2) 

r=0 

W 2-1) 

X{2k + l) = Y Mr)-X 2 (r)y N u N 2kr , * = 0, l.JV/2 - 1 . (3) 

r - 0 

where ome.ga T N = e -j 2 * r / N , 

These equations represent two N/ 2— point DFT’s of sequences [£i(r) + £ 2 ( r )] an< ^ — 

X 2 (?’)]w]v. The process is then repeatedly applied to the two subsequences. An example 
of an eight point decimation in frequency FFT algorithm is shown in Figure 2(a). The 
fundamental unit of computation that we use in our algorithm is a butterfly (see Figure 
2(b)). 

2.1 Parallel Implementation 

A typical implementation of FFT on a distributed memory machine results in a sequence 
of butterflies at each node interspersed with internode communication. Depending upon the 
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initial distribution, data for some of the butterflies is available locally and for others, off-node 
data are required. There are two approaches for computing butterflies which need off-node 
data. The first approach splits a butterfly between two nodes, and in the second approach 
a complete butterfly is computed on a node. The parallelism in the later case is achieved 
by distributing different butterflies on different nodes. For example, consider a simple case 
of computing two butterflies on a two node machine as shown in Figure 3(a). Notice that 
both butterflies need off-node data. The two approaches are illustrated in Figure 3(b) and 
Figure 3(c) respectively. It is obvious from these figures that the first approach has certain 
disadvantages. These are: 

High communication volume. The first approach requires twice the inter-node commu- 
nication volume as compared to the second approach. 

Unbalanced computational load . The first approach results in additional computation 
on some of the nodes. For example, the multiplication by u in the computation of both the 
butterflies of Figure 3(b) is done on node Pi, unlike the second approach (See Figure 3(c).) 
Extra storage. The first approach requires twice the storage of the second approach. 

For these reasons our parallel implementation of the FFT algorithm is based on the second 
approach. 

3 Algorithm 

As stated earlier, the main feature of our FFT scheme is that it works for block scattered 
data distribution with variable block sizes. That is, the same algorithm can be used for 
different data distributions without any initial rearrangement of the data. The algorithm 
consists of three phases: the first and the third phase compute butterflies for which the data 
is locally available, and the second phase computes butterflies for which off-node data is 
required. As a result, internode communication is required only during the second phase. 
Depending upon the block size the work distribution for the first and third phases will differ. 
In the extreme cases one of these two phases will not be executed. For a block size of one 
we need to execute only the first two phases, while for the block consisting of all the data 
on a node only the last two phases are executed. For all other block sizes all three phases 
of the algorithm are executed. Given the number of processors, the amount of work in the 
second phase remains constant for all block sizes. When all the data on a node forms a single 
block, it must be treated as a special case in phase 2 of the algorithm. This is so because 
to compute butterflies needing off-node data the block must be divided into two sub-blocks, 
which does not happen for other block sizes. 

An FFT algorithm with data size N has log 2 (A r ) distinct stages of computation. Each 
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of these stages compute Nj 2 butterflies. In our FFT scheme, as in most other parallel 
FFT schemes, all the nodes participate in computing a stage by operating on different data 
points. We describe the algorithm by considering an N point FFT on a p-node machine, 
with n ~ N/p as the number of data points mapped per node and b as the block size. The 
distribution of work in the three phases is as follows 

(i) The first log 2 (n/&) stages are computed in the first phase. The butterflies in these 
stages require data available locally from different blocks. 

(ii) The next log 2 (p) stages are computed in the second phase. The butterflies need off- 
node data, hence internode communication takes place. 

(iii) The last log 2 (b ) stages constitute phase three. The butterflies in phase three are again 
computed with local data, but from within a block. 

The algorithm has a computational kernel dftstep which is common to all the three 
phases. The kernel is common to all nodes and computes all the butterflies of a stage 
mapped onto a node. It assumes that the a; 5 s values have been precomputed and arranged 
so that they are available in the right order as needed. A FORTRAN call to the kernel can 
be made as follows: 

call dftstep(a , ic, of f set, groups, dist , wincr) 

where the arguments are: 

a Array of input sequence. 

w Array of trigonometric coefficients to. 

offset The distance between the two elements of a butterfly. 

groups The number of similar sets of butterflies. 

dist The distance between two groups. 

wincr The stride for w. 

The kernel dftstep essentially computes nj 2 butterflies. How these butterflies are formed 
is determined by the three arguments offset, groups and dist. For example consider the 
data in p 0 in Figure 1(a). If a stage requires that the butterflies be formed by combining 
xq with x$ and x\ with then a call to dftstep will have offset = 2, groups = 1 and any 
value in dist. On the other hand, if a stage requires combination of x 0 with X\ and Xg with 
xq then the values of these parameters will be offset = 1, groups = 2 and dist = 2. 

The pseudo code given below describes the FFT algorithm using ’dftstep’, { This code 
is executed on each node } 
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begin {phase 1} 

offset = nj 2 {distance between two points of a butterfly } 
groups = 1 {only one subgroup in the first stage } 
wincr = 1 { stride for lj } 
for i = 1 to log 2 (n/6) do 

dftstep(a,w, offset, groups, offset*2, wincr) 
offset = offset/2 
groups = groups*2 
end for 
end {phase 1} 
begin {phase 2 } 
offset = 7i j 2 
groups = offset /b 

for i = 1 to log 2 (p) do 

{negh is node with i th bit differing} 

{exchange half the data with negh } 
negh = mynode ® 2 d-t+I 

exchange(a(k),negh) {if bit i = 0 then k = 1, else k = n/2+1 } 
dftstep(a,w, offset, groups, b, wincr) 
if(6 = n) then {special case} 
negh = mynode ® 2 d 
exchange(a(k),negh) 
end if 
end for 
end {phase 2 } 
begin {phase 3 } 

for i = 1 to log 2 (6) do 
groups = groups*2 

dftstep(a, w, off set, groups, off set*2, wincr) 
offset = offset /2 
wincr = 2*winc.r 
end for 
end {phase 3 } 


The call to procedure exchange initiates a send first and then posts a receive for incoming 
data. This protocol is followed to ensure concurrent communication [10]. The working of 



the three phases is shown in Figure 4 with the initial data distribution of Figure 1(a). In 
this problem a total of 4 stages are required. The first phase is computed in stage one of the 
algorithm. The butterflies in this stage are formed by the corresponding elements of the two 
blocks. The second phase is computed in stages 2 and 3 which require exchanges of data. 
The fourth stage forms the third phase, which is computed by combining the data within a 
block. 


4 Rearrangement 

In general FFT algorithms generate the resultant sequence in an order different from the 
original one. As a result, the data is in the the wrong node. Also all the data to be sent to 
one node may not be contiguous. In circuit switch or message passing environments, sending 
one data at a time to a node is extremely expensive due to the overheads [9], Hence it is 
desirable that all the data destined for one particular node be collected in one place and 
sent together. In addition, at the destination node the data from different nodes may need 
to be interspersed. In view of these problems we require the following steps in the process 
of rearrangement at each node. 

(i) For each data point on a node determine the destination. The destination includes the 
node number, the block number and the displacement within the block. 

(ii) Collect all the data destined for one node in one set to avoid multiple sends. 

(iii) Send the collected data to the appropriate nodes and receive data being sent by other 
nodes. 

(iv) Place each data item in its destination block with the correct displacement. 

The data rearrangement problem in the worst case is equivalent to the complete exchange 
problem [13,. 12] (see Figure 5(a)). That is each node needs to send data to all the other 
nodes in the machine. However, that is not always the case. For example, the FFT shown 
in Figure 5(b) requires data exchange between only a subset of nodes. Each node sends data 
to and receives data from only two out of four nodes. This situation usually arises when the 
number of blocks mapped per node is of the same order as the number of of data points per 
block. 


5 Experimental results 

We evaluated the performance of our implementation on Intel iPSC/860 for different block 
sizes and for different data sizes. The Intel iPSC/860 is a distributed memory machine 
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which can have up to 128 nodes. Internode communication is done through a hypercube 
interconnection network. We carried out all of our experiments on 64 nodes of a 128 nodes 
machine. The results for different block sizes are summarized in Figure 6. The two curves in 
Figure 6 represent performance per node in Mflops for the complete code and for the FFT 
section alone. Recall that the complete code consists of the FFT computation followed by 
the data rearrangement. These results indicate that while the performance of FFT section 
varies by small amounts, the overall performance shows more noticeable differences. Both 
curves tend to peak in the middle. The variation in the over all performance is due to the 
data rearrangement. Depending on the block size, the data rearrangement may or may not 
be a complete exchange problem. Also, observe from Figure t that the data rearrangement 
time forms a large enough fraction of the total execution time to have influence on the overall 
performance. 

The reason for the slight variation in the FFT section performance lies in the relative 
distribution of work between the three phases of the algorithm. In Figure 8 we have plotted 
the effect of block size on relative distribution of work in the three phases of the FFT section. 
As expected, the fraction of time taken by the first phase is maximum for the smallest block 
size and steadily decreases as the block size increases. The third phase exhibits a reverse 
trend. The region where both these phases have approximately equal work is also the region 
which shows higher performance in Figure 6. Also notice from Figure 8 that the fraction of 
time used in the second phase remains almost constant for all block sizes. The the second 
phase takes more time than the other two since it also involves internode communication. 

The effect of data size on the communication and computation time of the fft section of 
the code is shown in Figure 9. To plot this we picked the best performance for every data 
size. With very small data sizes almost the entire time is taken up by the communication. 
As the data sizes increase, computation starts taking larger fractions of the execution time. 
The computation fraction tends to saturate when the data sizes become sufficiently large. 
To explain the reason for this behavior we consider the communication characteristics of the 
machine 

In Intel iPSC/860 the cost of communication is determined by the expression 

t C omm = 164 + 0.398a + 29.90 

where a is the number of bytes in the message and ft is the distance between two nodes [12]. 
The first term in the equation is the setup overhead. As a increases for fixed ft, the fraction 
of total time used in setup decreases. Notice from Figure 9 that the saturation occurs for 
cv = 2 11 . The value of ft is 1. For this data size the contribution from the overhead term in 

the expression is about 5%. 
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6 Conclusions 


In this paper we gave an FFT implementation on a distributed memory parallel machine 
which works for a number of data distributions commonly encountered in scientific applica- 
tions. 

We evaluated the performance of our implementation on the Intel iPSC/860. The results 
of our experiments indicate that the variation in block sizes has more effect on the perfor- 
mance of the rearrangement section than on the FFT section. For certain block sizes the data 
rearrangement cost is significantly lower than for others. In the FFT section, the variation 
in performance is not significant enough to make the choice of block size a critical issue. On 
a 64 node machine we obtained a peak performance of 203 Mflops, that is 3.17 Mflops per 
node. (This figure does not include the initialization costs which are incurred only once for 
a given size FFT) If we include the data rearrangement, the performance decreases to 2.78 
Mflops per node. 

We believe that the data rearrangement can be made more efficient. At present we have 
adapted one of the complete exchange schemes. It is not clear whether it is the best option 
for the data rearrangement problems arising in FFT or other similar algorithms. Also, the 
performance can be further improved (expected to be about 25%) by using an optimized (at 
assembly level) version of dftstep. 
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Figure 6. The variation in performance as blocksize is increased. 



Figure 7. The fraction of time taken by FFT and Rearrangement 




Figure 9. The relative contributions of computation and communication in the FFT section 
as the datasize is varied. 
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